[Distributed] Fix tiktokenizer decoding #1257

kwen2501 · 2024-10-03T08:02:17Z

TiktokenTokenizer (used by llama3) and SentencePieceProcessor (used by llama2) seem to have different requirement on input shape (1D list or 2D list).

Adding a condition to separate the input prep for the two.

Also improved some logging.

pytorch-bot · 2024-10-03T08:02:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1257

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e54c8d1 with merge base 34d6831 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lessw2020 · 2024-10-03T17:02:33Z

dist_run.py

        res = torch.cat(res, dim=1)
        res_list = res.tolist()
-        responses = tokenizer.decode(res_list)
+        if isinstance(tokenizer, TiktokenTokenizer):


nit - I think this type of check is better done directly in the _build_chat_tokenizer function and then we just have an enum for the tokenizer type and can correctly err out if a not recognized tokenizer.
The reason for that is two fold:
a - we future proof ourself so that if llama4 has a new tokenizer we are not going back through the code trying to figure where all the 'tokenizer' type checks are, and updating.
b - we only check once in a logical point, and then all other code uses the enum, and we have an upfront single failure point to err out if we are hitting an unrecognized tokenizer.
As this code is currently, it assumes that if not tiktoken then it must be sentencepiece which is a brittle assumption long term.

lessw2020

thanks for the update!
I left a comment about making the tokenizer check more robust long term, but this is fine to land.

kwen2501 · 2024-10-04T06:39:45Z

Thanks @lessw2020 . Do you know why TiktokenTokenizer cannot decode 2D list?

[Distributed] Fix tiktokenizer decoding

e54c8d1

kwen2501 requested a review from lessw2020 October 3, 2024 08:02

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024

lessw2020 reviewed Oct 3, 2024

View reviewed changes

lessw2020 approved these changes Oct 3, 2024

View reviewed changes

lessw2020 merged commit 32241ff into main Oct 3, 2024
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed] Fix tiktokenizer decoding #1257

[Distributed] Fix tiktokenizer decoding #1257

Uh oh!

kwen2501 commented Oct 3, 2024

Uh oh!

pytorch-bot bot commented Oct 3, 2024 •

edited

Loading

Uh oh!

lessw2020 Oct 3, 2024

Uh oh!

lessw2020 left a comment

Uh oh!

Uh oh!

kwen2501 commented Oct 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Distributed] Fix tiktokenizer decoding #1257

[Distributed] Fix tiktokenizer decoding #1257

Uh oh!

Conversation

kwen2501 commented Oct 3, 2024

Uh oh!

pytorch-bot bot commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1257

✅ No Failures

Uh oh!

lessw2020 Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

lessw2020 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kwen2501 commented Oct 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Oct 3, 2024 •

edited

Loading